AITopics | terminal reward

Collaborating Authors

terminal reward

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Li, Yuming, Wang, Yikai, Zhu, Yuying, Zhao, Zhongyu, Lu, Ming, She, Qi, Zhang, Shanghang

arXiv.org Artificial IntelligenceSep-30-2025

Recent progress in aligning image and video generative models with Group Relative Policy Optimization (GRPO) has improved human preference alignment, but existing variants remain inefficient due to sequential rollouts and large numbers of sampling steps, unreliable credit assignment: sparse terminal rewards are uniformly propagated across timesteps, failing to capture the varying criticality of decisions during denoising. In this paper, we present BranchGRPO, a method that restructures the rollout process into a branching tree, where shared prefixes amortize computation and pruning removes low-value paths and redundant depths. BranchGRPO introduces three contributions: (1) a branching scheme that amortizes rollout cost through shared prefixes while preserving exploration diversity; (2) a reward fusion and depth-wise advantage estimator that transforms sparse terminal rewards into dense step-level signals; and (3) pruning strategies that cut gradient computation but leave forward rollouts and exploration unaffected. On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to \textbf{16\%} over DanceGRPO, while reducing per-iteration training time by nearly \textbf{55\%}. A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than DanceGRPO without degrading alignment. On WanX video generation, it further achieves higher Video-Align scores with sharper and temporally consistent frames compared to DanceGRPO. Codes are available at \href{https://fredreic1849.github.io/BranchGRPO-Webpage/}{BranchGRPO}.

branchgrpo, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.0604

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Efficient Feature Mapping Using a Collaborative Team of AUVs

Biggs, Benjamin, Stilwell, Daniel J., Yetkin, Harun, McMahon, James

arXiv.org Artificial IntelligenceDec-26-2024

We present the results of experiments performed using a team of small autonomous underwater vehicles (AUVs) to determine the location of an isobath. The primary contributions of this work are (1) the development of a novel objective function for level set estimation that utilizes a rigorous assessment of uncertainty, and (2) a description of the practical challenges and corresponding solutions needed to implement our approach in the field using a team of AUVs. We combine path planning techniques and an approach to decentralization from prior work that yields theoretical performance guarantees. Experimentation with a team of AUVs provides empirical evidence that the desirable performance guarantees can be preserved in practice even in the presence of limitations that commonly arise in underwater robotics, including slow and intermittent acoustic communications and limited computational resources.

artificial intelligence, horizon path, planning & scheduling, (18 more...)

arXiv.org Artificial Intelligence

2412.19409

Country:

Europe (0.68)
North America > United States > Virginia (0.29)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.68)

Add feedback

Mitigating Dimensionality in 2D Rectangle Packing Problem under Reinforcement Learning Schema

Kołodziejczyk, Waldemar, Kaleta, Mariusz

arXiv.org Artificial IntelligenceSep-15-2024

This paper explores the application of Reinforcement Learning (RL) to the two-dimensional rectangular packing problem. We propose a reduced representation of the state and action spaces that allow us for high granularity. Leveraging UNet architecture and Proximal Policy Optimization (PPO), we achieved a model that is comparable to the MaxRect heuristic. However, our approach has great potential to be generalized to nonrectangular packing problems and complex constraints.

artificial intelligence, machine learning, reinforcement learning, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.17388/WUT.2024.0002.MiNI

2409.09677

Country:

Europe > Poland > Masovia Province > Warsaw (0.05)
Asia > China (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

The generator gradient estimator is an adjoint state method for stochastic differential equations

Badolle, Quentin, Gupta, Ankit, Khammash, Mustafa

arXiv.org Machine LearningJul-29-2024

Motivated by the increasing popularity of overparameterized Stochastic Differential Equations (SDEs) like Neural SDEs, Wang, Blanchet and Glynn recently introduced the generator gradient estimator, a novel unbiased stochastic gradient estimator for SDEs whose computation time remains stable in the number of parameters. In this note, we demonstrate that this estimator is in fact an adjoint state method, an approach which is known to scale with the number of states and not the number of parameters in the case of Ordinary Differential Equations (ODEs). In addition, we show that the generator gradient estimator is a close analogue to the exact Integral Path Algorithm (eIPA) estimator which was introduced by Gupta, Rathinam and Khammash for a class of Continuous-Time Markov Chains (CTMCs) known as stochastic chemical reactions networks (CRNs).

adjoint state method, estimator, generator gradient estimator, (9 more...)

arXiv.org Machine Learning

2407.20196

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Mathematics of Computing (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Approximate Dec-POMDP Solving Using Multi-Agent A*

Koops, Wietze, Junges, Sebastian, Jansen, Nils

arXiv.org Artificial IntelligenceMay-9-2024

We present an A*-based algorithm to compute policies for finite-horizon Dec-POMDPs. Our goal is to sacrifice optimality in favor of scalability for larger horizons. The main ingredients of our approach are (1) using clustered sliding window memory, (2) pruning the A* search tree, and (3) using novel A* heuristics. Our experiments show competitive performance to the state-of-the-art. Moreover, for multiple benchmarks, we achieve superior performance. In addition, we provide an A* algorithm that finds upper bounds for the optimum, tailored towards problems with long horizons. The main ingredient is a new heuristic that periodically reveals the state, thereby limiting the number of reachable beliefs. Our experiments demonstrate the efficacy and scalability of the approach.

algorithm, dec-pomdp, window memory, (16 more...)

arXiv.org Artificial Intelligence

2405.05662

Country:

Europe > Slovenia (0.04)
Europe > Netherlands > Gelderland > Nijmegen (0.04)
Europe > Germany (0.04)

Genre: Research Report (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

Supplementing Gradient-Based Reinforcement Learning with Simple Evolutionary Ideas

Khadilkar, Harshad

arXiv.org Artificial IntelligenceMay-10-2023

We present a simple, sample-efficient algorithm for introducing large but directed learning steps in reinforcement learning (RL), through the use of evolutionary operators. The methodology uses a population of RL agents training with a common experience buffer, with occasional crossovers and mutations of the agents in order to search efficiently through the policy space. Unlike prior literature on combining evolutionary search (ES) with RL, this work does not generate a distribution of agents from a common mean and covariance matrix. Neither does it require the evaluation of the entire population of policies at every time step. Instead, we focus on gradient-based training throughout the life of every policy (individual), with a sparse amount of evolutionary exploration. The resulting algorithm is shown to be robust to hyperparameter variations. As a surprising corollary, we show that simply initialising and training multiple RL agents with a common memory (with no further evolutionary updates) outperforms several standard RL baselines.

evolutionary algorithm, machine learning, reinforcement learning, (13 more...)

arXiv.org Artificial Intelligence

2305.07571

Country: Asia > India > Maharashtra > Mumbai (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Add feedback

Using Contrastive Samples for Identifying and Leveraging Possible Causal Relationships in Reinforcement Learning

Khadilkar, Harshad, Meisheri, Hardik

arXiv.org Artificial IntelligenceOct-28-2022

A significant challenge in reinforcement learning is quantifying the complex relationship between actions and long-term rewards. The effects may manifest themselves over a long sequence of state-action pairs, making them hard to pinpoint. In this paper, we propose a method to link transitions with significant deviations in state with unusually large variations in subsequent rewards. Such transitions are marked as possible causal effects, and the corresponding state-action pairs are added to a separate replay buffer. In addition, we include \textit{contrastive} samples corresponding to transitions from a similar state but with differing actions. Including this Contrastive Experience Replay (CER) during training is shown to outperform standard value-based methods on 2D navigation tasks. We believe that CER can be useful for a broad class of learning tasks, including for any off-policy reinforcement learning algorithm.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

2210.17296

Country: North America > United States > Massachusetts > Hampshire County > Amherst (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Experiments in Underwater Feature Tracking with Performance Guarantees Using a Small AUV

Biggs, Benjamin, He, Hans, McMahon, James, Stilwell, Daniel J.

arXiv.org Artificial IntelligenceOct-5-2022

We present the results of experiments performed using a small autonomous underwater vehicle to determine the location of an isobath within a bounded area. The primary contribution of this work is to implement and integrate several recent developments real-time planning for environmental mapping, and to demonstrate their utility in a challenging practical example. We model the bathymetry within the operational area using a Gaussian process and propose a reward function that represents the task of mapping a desired isobath. As is common in applications where plans must be continually updated based on real-time sensor measurements, we adopt a receding horizon framework where the vehicle continually computes near-optimal paths. The sequence of paths does not, in general, inherit the optimality properties of each individual path. Our real-time planning implementation incorporates recent results that lead to performance guarantees for receding-horizon planning.

artificial intelligence, machine learning, real time system, (21 more...)

arXiv.org Artificial Intelligence

2210.02524

Country:

North America > United States > Virginia > Montgomery County > Blacksburg (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > District of Columbia > Washington (0.04)
(4 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Robots (0.91)
Information Technology > Architecture > Real Time Systems (0.75)
(2 more...)

Add feedback

Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning

Zhai, Yuexiang, Baek, Christina, Zhou, Zhengyuan, Jiao, Jiantao, Ma, Yi

Journal of Artificial Intelligence ResearchMar-12-2022

Many goal-reaching reinforcement learning (RL) tasks have empirically verified that rewarding the agent on subgoals improves convergence speed and practical performance. We attempt to provide a theoretical framework to quantify the computational benefits of rewarding the completion of subgoals, in terms of the number of synchronous value iterations. In particular, we consider subgoals as one-way intermediate states, which can only be visited once per episode and propose two settings that consider these one-way intermediate states: the one-way single-path (OWSP) and the one-way multi-path (OWMP) settings. In both OWSP and OWMP settings, we demonstrate that adding intermediate rewards to subgoals is more computationally efficient than only rewarding the agent once it completes the goal of reaching a terminal state. We also reveal a trade-off between computational complexity and the pursuit of the shortest path in the OWMP setting: adding intermediate rewards significantly reduces the computational complexity of reaching the goal but the agent may not find the shortest path, whereas with sparse terminal rewards, the agent finds the shortest path at a significantly higher computational cost. We also corroborate our theoretical results with extensive experiments on the MiniGrid environments using Q-learning and some popular deep RL algorithms.

assumption 3, intermediate reward, intermediate state, (13 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.1.13326

AI Access Foundation

13326

Journal of Artificial Intelligence Research

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Industry:

Leisure & Entertainment > Games (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Computational Benefits of Intermediate Rewards for Hierarchical Planning

Zhai, Yuexiang, Baek, Christina, Zhou, Zhengyuan, Jiao, Jiantao, Ma, Yi

arXiv.org Artificial IntelligenceJul-8-2021

Many hierarchical reinforcement learning (RL) applications have empirically verified that incorporating prior knowledge in reward design improves convergence speed and practical performance. We attempt to quantify the computational benefits of hierarchical RL from a planning perspective under assumptions about the intermediate state and intermediate rewards frequently (but often implicitly) adopted in practice. Our approach reveals a trade-off between computational complexity and the pursuit of the shortest path in hierarchical planning: using intermediate rewards significantly reduces the computational complexity in finding a successful policy but does not guarantee to find the shortest path, whereas using sparse terminal rewards finds the shortest path at a significantly higher computational cost. We also corroborate our theoretical results with extensive experiments on the MiniGrid environments using Q-learning and other popular deep RL algorithms.

assumption 3, intermediate reward, intermediate state, (12 more...)

arXiv.org Artificial Intelligence

2107.03961

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback